Recipe for Tunisian_MSA corpus.#2722
Recipe for Tunisian_MSA corpus.#2722johnjosephmorgan wants to merge 325 commits intokaldi-asr:masterfrom
Conversation
…ything under data/local
…l number of jobs to 1.
…-chunk-per-minibatch=64,32,16
…l number of jobs to 1.
…nstead of data/lang in mkgraph command
|
It looks like I was not able to resolve the conflicts. I accept help :) |
|
OK @xiaohui-zhang and @huangruizhe will look at it. |
|
The conflicts seem to be due to changes to files in the heroico recipe that don't seem to be related to the Tunisian MSA recipe. |
|
done in #2725 |
…oded Arabic to utf8 encoded arabic.
…buckwalter to utf8.
| @@ -0,0 +1,10 @@ | |||
| #!/bin/bash | |||
|
|
|||
| cut -d " " -f 1 qcri.txt > qcri_words_buckwalter.txt | |||
There was a problem hiding this comment.
this should probably be
cat qcri.txt | tail -n +4 | cut -d " " -f 1 > qcri_words_buckwalter.txt
cat qcri.txt | tail -n +4 | cut -d " " -f 2 > qcri_prons.txt
Otherwise lines like "# Copyright" will be included.
|
I was already doing this in the download script:
egs/Tunisian_msa/s5/local/qcri_lexicon_download.sh
I wasn't overwriting the unzipped text file with the header removed,
so I fixed that.
J
…On 9/20/18, Xiaohui Zhang ***@***.***> wrote:
xiaohui-zhang commented on this pull request.
> @@ -0,0 +1,10 @@
+#!/bin/bash
+
+cut -d " " -f 1 qcri.txt > qcri_words_buckwalter.txt
this should probably be
cat qcri.txt | tail -n +4 | cut -d " " -f 1 > qcri_words_buckwalter.txt
cat qcri.txt | tail -n +4 | cut -d " " -f 2 > qcri_prons.txt
Otherwise lines like "# Copyright" will be included.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#2722 (review)
|
| #!/bin/bash | ||
|
|
||
| cut -d " " -f 1 qcri.txt > qcri_words_buckwalter.txt | ||
| cut -d " " -f 2 qcri.txt > qcri_prons.txt |
There was a problem hiding this comment.
Actually the option -f 2 should be -f 2-. I realized this after I saw totally wrong decoding results...
|
I'm closing this PR because I believe we merged it indirectly via #2725. If there are other changes you want us to merge, please let us know. |
|
I removed 2 scripts that are not used and 2 config files. And I added a copyright.
conf/pitch.conf
conf/plp.conf
local/buckwalter2utf8.pl
local/qcri_buckwalter2utf8.pl
John
… On Sep 30, 2018, at 2:48 PM, Daniel Povey ***@***.***> wrote:
I'm closing this PR because I believe we merged it indirectly via #2725 <#2725>. If there are other changes you want us to merge, please let us know.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#2722 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABK_gOfxFNhOYOYXKrEObm3ltCrzm-32ks5ugRH5gaJpZM4Wwdhl>.
|
|
@xiaohui-zhang would you mind making a PR with the latest changes, if appropriate? No hurry. |
|
sure!
…On Sun, Sep 30, 2018 at 8:31 PM Daniel Povey ***@***.***> wrote:
@xiaohui-zhang <https://github.com/xiaohui-zhang> would you mind making a
PR with the latest changes, if appropriate? No hurry.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2722 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ANiEEa2y0aKL2fkc0ml6gY1MKYNSrvrHks5ugWJHgaJpZM4Wwdhl>
.
--
Xiaohui
|
A recipe to build an ASR system with the Tunisian_MSA corpus of Arabic.